This project began as an extension of a DataCamp course on business analytics using Microsoft Excel. The course used an Excel Workbook with global sales and client data for athletic wear and taught how to create basic graphs and tables in Excel. After completing the course, I challenged myself to see if I could recreate the graphs and tables using RStudio, which is my analytics tool of choice.
After completing this challenge, I went a step further and added to my analytics by creating interactive graphs and global heat maps of sales to add even more analytics and insight to a potential colleague, boss, or investor. Below are the various graphs and tables with ample descriptions of what I created and why I thought it was important.
The data used for the following analytics comes from a DataCamp course on Microsoft Excel, which can be found here. I completed the first chapter of the course, which walks through data cleaning, sheet renaming, and sheet reorganization, which is necessary to complete prior to following along with my analytics below (the first chapter of the DataCamp course is free to complete). After completing the course, I decided to load in the “Orders” sheet from the DataCamp workbook to conduct my analytics.
# load in pacman
library(pacman)
# load in necessary packages
p_load(readxl,janitor,ggplot2,plotly,ggthemes,dplyr)
# load in orders sheet from DataCamp Workbook
sales_data <- read_excel("~/github_repos/projects-and-work-samples/athletic_wear_sales_analytics_project_files/data/sales_data.xlsx",
sheet = "Orders")
sales_data## # A tibble: 1,269 × 24
## Order I…¹ Order…² `Order Date` Days …³ Late_…⁴ Shipp…⁵ Categ…⁶ Custo…⁷
## <dbl> <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 6176 2015 2015-01-04 00:00:00 4 0 Standa… 17 3329
## 2 10384 2015 2015-01-06 00:00:00 4 1 Standa… 17 587
## 3 14551 2015 2015-01-08 00:00:00 2 0 Second… 17 2028
## 4 14574 2015 2015-01-08 00:00:00 2 1 Second… 18 6594
## 5 22924 2015 2015-01-12 00:00:00 2 1 Second… 29 9704
## 6 906 2015 2015-01-14 00:00:00 4 0 Standa… 40 7141
## 7 973 2015 2015-01-15 00:00:00 4 0 Standa… 29 5118
## 8 1077 2015 2015-01-16 00:00:00 2 1 Second… 9 8103
## 9 1105 2015 2015-01-17 00:00:00 4 1 Standa… 37 9760
## 10 1186 2015 2015-01-18 00:00:00 4 0 Standa… 9 11947
## # … with 1,259 more rows, 16 more variables: `Department Id` <dbl>,
## # `Department Name` <chr>, Market <chr>, `Order City` <chr>,
## # `Order State` <chr>, `Order Zipcode` <dbl>, `Order Country` <chr>,
## # `Order Region` <chr>, `Product Category Id` <dbl>, `Product Id` <dbl>,
## # `Product Price` <dbl>, `Product Cost` <dbl>, `Order Quantity` <dbl>,
## # `Order Total Discount` <dbl>, Sales <dbl>, `Payment Type` <chr>, and
## # abbreviated variable names ¹`Order Id`, ²`Order Year`, …
The data above contains all sorts of categorical and numerical data
regarding individual orders of athletic wear. Lots of potential for
analytics! Next, I will clean the data a bit, focusing on the column
titles and making them more r-friendly (meaning lowercase letters and
underscores instead of spaces). I will use the janitor
package to quickly clean the column names using the
clean_names() function.
## # A tibble: 1,269 × 24
## order_id order_…¹ order_date days_…² late_…³ shipp…⁴ categ…⁵ custo…⁶
## <dbl> <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 6176 2015 2015-01-04 00:00:00 4 0 Standa… 17 3329
## 2 10384 2015 2015-01-06 00:00:00 4 1 Standa… 17 587
## 3 14551 2015 2015-01-08 00:00:00 2 0 Second… 17 2028
## 4 14574 2015 2015-01-08 00:00:00 2 1 Second… 18 6594
## 5 22924 2015 2015-01-12 00:00:00 2 1 Second… 29 9704
## 6 906 2015 2015-01-14 00:00:00 4 0 Standa… 40 7141
## 7 973 2015 2015-01-15 00:00:00 4 0 Standa… 29 5118
## 8 1077 2015 2015-01-16 00:00:00 2 1 Second… 9 8103
## 9 1105 2015 2015-01-17 00:00:00 4 1 Standa… 37 9760
## 10 1186 2015 2015-01-18 00:00:00 4 0 Standa… 9 11947
## # … with 1,259 more rows, 16 more variables: department_id <dbl>,
## # department_name <chr>, market <chr>, order_city <chr>, order_state <chr>,
## # order_zipcode <dbl>, order_country <chr>, order_region <chr>,
## # product_category_id <dbl>, product_id <dbl>, product_price <dbl>,
## # product_cost <dbl>, order_quantity <dbl>, order_total_discount <dbl>,
## # sales <dbl>, payment_type <chr>, and abbreviated variable names
## # ¹order_year, ²days_for_shipment_scheduled, ³late_delivery_risk, …
Now the column headers are all lowercase and have underscores instead of spaces!
Now that the data is clean and easier to work with, I want to see how total annual sales have changed in certain markets. First I will look at the sales in Africa.
sales_data %>%
# filter for only observations in Africa
filter(market=="Africa") %>%
# group by order year so we can plot one point for each year
group_by(order_year) %>%
# create a sum of the 'sales' column after grouping by each year to obtain the total sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(mapping = aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "firebrick", size = 3) +
geom_line(color = "firebrick", linewidth = 1.5) +
# use 'ggthemes' package to add a professional looking theme to the graph
theme_stata() +
# add labels and title
labs(title = "Annual Sales in Africa",
x = "Year",
y = "Total Annual Sales (USD)")Looks like Africa only has two years worth of data, in 2016 and 2017,
and it’s clear that sales declined over that time period. I can’t tell
from this graph what the exact sales values are, but I can venture an
educated guess. If I want to interact with the graph, I can wrap the
ggplotly function from the plotly package
around my ggplot() call from above to see exactly what the
values are for each of the two data points.
ggplotly(sales_data %>%
# filter for only observations in Africa
filter(market=="Africa") %>%
# group by order year so we can plot one point for each year
group_by(order_year) %>%
# create a sum of the 'sales' column after grouping by each year to obtain the total sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(mapping = aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "firebrick", size = 3) +
geom_line(color = "firebrick", linewidth = 1.5) +
# use 'ggthemes' package to add a professional looking theme to the graph
theme_stata() +
# add labels and title
labs(title = "Annual Sales in Africa",
x = "Year",
y = "Total Annual Sales (USD)"))The graph looks exactly the same, but if you hover over the data points we can see the actual total sales amount for 2016 and 2017. Now I see that sales fell from $40,484.64 in 2016 to $ 6,477.34 in 2017.
Next, I want to look at the annual sales in Europe. First, I’ll start
with the standard ggplot() graph.
sales_data %>%
# filter for European observations
filter(market == "Europe") %>%
# group by 'order_year' to calculate total sales per each year
group_by(order_year) %>%
# Create total_sales value that is sum of sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "steelblue3",
size = 3) +
geom_line(color = "steelblue3",
linewidth = 1.5) +
# add theme using 'ggthemes' package
theme_stata() +
# add labels
labs(title = "Annual Sales in Europe",
x = "Year",
y = "Total Annual Sales")Europe has one more years worth of data than Africa, with data from
2015-2017, and has additional volatility over that time period. Like I
did with the Africa graph, I will use the ggplotly()
function here to observe the sales values from each year.
ggplotly(sales_data %>%
# filter for European observations
filter(market == "Europe") %>%
# group by 'order_year' to calculate total sales per each year
group_by(order_year) %>%
# Create total_sales value that is sum of sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "steelblue3",
size = 3) +
geom_line(color = "steelblue3",
size = 1.5) +
# add theme using 'ggthemes' package
theme_stata() +
# add labels
labs(title = "Annual Sales in Europe",
x = "Year",
y = "Total Annual Sales"))## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
With the interactive graph above, I see that sales in Europe were equivalent to $21,717.55 in 2015, decreasing to $2,964.55 in 2016, and rising finally to $36,172.24 in 2017.
The next analytics I am interested in conducting is a graph of all sales in each department to get a better idea as to which departments can generate the most revenue for this company per order. I will create a bar graph that visualizes this data, opting this time to only show the interactive plot so I can observe the exact sales values for each order from each department.
# wrap in ggplotly() to make interactive
ggplotly(sales_data %>%
# use reorder() in aes() function to put bars in desc. order (-sales means desc. order by sales variable)
ggplot(aes(department_name,
sales)) +
# add bars. "stat = 'identity'" allows me to change the y axis from count (default setting) to sales variable from data
geom_bar(stat = "identity", fill = "skyblue3", position = 'dodge') +
theme_stata() +
labs(title = "Revenue values by Department",
y = "Sales (USD)",
x = "Department")) This graph offers a lot in terms of actionable insights. By moving the cursor up inside of each bar, I can see the dollar value of each order from that particular department. Clearly, the technology department can offer the most in terms of revenue, as it has the most expensive orders in all the data, with fitness being a distant second. In order to make a decision on revenue, however, I will next create a bar graph detailing the total sales per department and see if the results are the same or different from this graph.
In order to tell which department is responsible for the most sales,
or revenue, I will need to calculate the total sales by department
first, then create a graph similar to the ones above, but using the
newly created total_sales value instead of the normal
sales value.
# create new column for total sales grouped by department
sales_data <-sales_data %>%
group_by(department_name) %>%
mutate(total_department_sales = sum(sales))
# create graph
ggplotly(
ggplot(sales_data,
aes(department_name,total_department_sales)) +
geom_bar(stat = 'identity',
position = 'dodge',
fill = 'palevioletred') +
theme_stata() +
labs(title = "Total Revenue by Department",
y = "Total Sales (USD)",
x = "Department")
)Now this graph tells a very different story from the previous one. In the previous graph, we saw that the technology department had the highest order values of any department, but this new graph shows that the technology department accounts for very little in terms of total revenue. Apparel is by far bringing in the most revenue, followed by Fitness, Footwear, and Golf all bringing in similar amounts of revenue. So, even though the technology department has some of the most expensive orders, these are a handful of orders that, when compared to the massive quantities of apparel orders, is dwarfed in terms of the amount of revenue generated.
Now that we have seen the individual and total sales values across departments, let’s add another element, the market, to see how departmental sales vary geographically. This will provide us with insight as to where geographically each department is generating the most revenue.
# wrap in ggplotly() to create interactive graph
ggplotly(
sales_data %>%
# group by market
group_by(market, department_name) %>%
# create total sales value by summing sales for each market
summarise(total_market_sales = sum(sales)) %>%
# graph total sales by market
ggplot(aes(x = market, y = total_market_sales, fill = department_name)) +
geom_bar(stat = "identity",
position = "dodge") +
labs(title = "Total Department Sales by Market",
y = "Total Sales (USD)",
x = "Market",
fill = "Department Name") +
scale_fill_colorblind()
)## `summarise()` has grouped output by 'market'. You can override using the
## `.groups` argument.